1,130 research outputs found
Three-Dimensional Phylogeny Explorer: Distinguishing paralogs, lateral transfer, and violation of "molecular clock" assumption with 3D visualization
<p>Abstract</p> <p>Background</p> <p>Construction and interpretation of phylogenetic trees has been a major research topic for understanding the evolution of genes. Increases in sequence data and complexity are creating a need for more powerful and insightful tree visualization tools.</p> <p>Results</p> <p>We have developed 3D Phylogeny Explorer (3DPE), a novel phylogeny tree viewer that maps trees onto three spatial axes (species on the X-axis; paralogs on Z; evolutionary distance on Y), enabling one to distinguish at a glance evolutionary features such as speciation; gene duplication and paralog evolution; lateral gene transfer; and violation of the "molecular clock" assumption. Users can input any tree on the online 3DPE, then rotate, scroll, rescale, and explore it interactively as "live" 3D views. All objects in 3DPE are clickable to display subtrees, connectivity path highlighting, sequence alignments, and gene summary views, and etc. To illustrate the value of this visualization approach for microbial genomes, we also generated 3D phylogeny analyses for all clusters from the public COG database. We constructed tree views using well-established methods and graph algorithms. We used Scientific Python to generate VRML2 3D views viewable in any web browser.</p> <p>Conclusion</p> <p>3DPE provides a novel phylogenetic tree projection method into 3D space and its web-based implementation with live 3D features for reconstruction of phylogenetic trees of COG database.</p
Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots
Although taxonomy is often used informally to evaluate the results of
phylogenetic inference and find the root of phylogenetic trees, algorithmic
methods to do so are lacking. In this paper we formalize these procedures and
develop algorithms to solve the relevant problems. In particular, we introduce
a new algorithm that solves a "subcoloring" problem for expressing the
difference between the taxonomy and phylogeny at a given rank. This algorithm
improves upon the current best algorithm in terms of asymptotic complexity for
the parameter regime of interest; we also describe a branch-and-bound algorithm
that saves orders of magnitude in computation on real data sets. We also
develop a formalism and an algorithm for rooting phylogenetic trees according
to a taxonomy. All of these algorithms are implemented in freely-available
software.Comment: Version submitted to Algorithms for Molecular Biology. A number of
fixes from previous versio
On strongly chordal graphs that are not leaf powers
A common task in phylogenetics is to find an evolutionary tree representing
proximity relationships between species. This motivates the notion of leaf
powers: a graph G = (V, E) is a leaf power if there exist a tree T on leafset V
and a threshold k such that uv is an edge if and only if the distance between u
and v in T is at most k. Characterizing leaf powers is a challenging open
problem, along with determining the complexity of their recognition. This is in
part due to the fact that few graphs are known to not be leaf powers, as such
graphs are difficult to construct. Recently, Nevries and Rosenke asked if leaf
powers could be characterized by strong chordality and a finite set of
forbidden subgraphs.
In this paper, we provide a negative answer to this question, by exhibiting
an infinite family \G of (minimal) strongly chordal graphs that are not leaf
powers. During the process, we establish a connection between leaf powers,
alternating cycles and quartet compatibility. We also show that deciding if a
chordal graph is \G-free is NP-complete, which may provide insight on the
complexity of the leaf power recognition problem
IsoBase: a database of functionally related proteins across PPI networks
We describe IsoBase, a database identifying functionally related proteins, across five major eukaryotic model organisms: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus and Homo Sapiens. Nearly all existing algorithms for orthology detection are based on sequence comparison. Although these have been successful in orthology prediction to some extent, we seek to go beyond these methods by the integration of sequence data and protein–protein interaction (PPI) networks to help in identifying true functionally related proteins. With that motivation, we introduce IsoBase, the first publicly available ortholog database that focuses on functionally related proteins. The groupings were computed using the IsoRankN algorithm that uses spectral methods to combine sequence and PPI data and produce clusters of functionally related proteins. These clusters compare favorably with those from existing approaches: proteins within an IsoBase cluster are more likely to share similar Gene Ontology (GO) annotation. A total of 48 120 proteins were clustered into 12 693 functionally related groups. The IsoBase database may be browsed for functionally related proteins across two or more species and may also be queried by accession numbers, species-specific identifiers, gene name or keyword. The database is freely available for download at http://isobase.csail.mit.edu/.National Institute of General Medical Sciences (U.S.) (Grant Number 1R01GM081871)Fannie and John Hertz FoundationNational Science Foundation (U.S.) (NSF MSPRF)National Science Council of Taiwan (NSC99-2218-E-007-010)National Institutes of Health (U.S.) (1R01GM081871
Haemophilus Influenzae Microarrays: Virulence and Vaccines
In 1995 the genome sequence of the Haemophilus influenzae KW20 (Rd) strain was published, the first available for a free-living organism. The genome has been
invaluable in global strategies to identify certain virulence-related genes, e.g. those
involved in LPS synthesis, and also essential genes, but there is a paucity of wholegenome
transcriptome studies. We have now constructed a whole-genome array
consisting of genes from Rd, additional genes identified in other strains of H.
influenzae and controls (from eukaryotic sources and other bacteria). We intend to
use this array in studies aimed at understanding the bacterium’s basic metabolism
and its response to changing environments; deciphering global regulatory networks
(by comparison of wild-type and mutant strains); and identifying genes expressed
in vivo. The use of H. influenzae DNA arrays combined with proteomic approaches
will enhance our understanding of the metabolism and virulence of the organism.
Additionally, the genome sequence of a non-typable H. influenzae strain is in progress.
The sequence from this isolate will be invaluable not only in identifying potential
novel antibiotic targets and putative vaccine candidates but also in the design of a
microarray for genome-typing purposes
Statistically validated networks in bipartite complex systems
Many complex systems present an intrinsic bipartite nature and are often
described and modeled in terms of networks [1-5]. Examples include movies and
actors [1, 2, 4], authors and scientific papers [6-9], email accounts and
emails [10], plants and animals that pollinate them [11, 12]. Bipartite
networks are often very heterogeneous in the number of relationships that the
elements of one set establish with the elements of the other set. When one
constructs a projected network with nodes from only one set, the system
heterogeneity makes it very difficult to identify preferential links between
the elements. Here we introduce an unsupervised method to statistically
validate each link of the projected network against a null hypothesis taking
into account the heterogeneity of the system. We apply our method to three
different systems, namely the set of clusters of orthologous genes (COG) in
completely sequenced genomes [13, 14], a set of daily returns of 500 US
financial stocks, and the set of world movies of the IMDb database [15]. In all
these systems, both different in size and level of heterogeneity, we find that
our method is able to detect network structures which are informative about the
system and are not simply expression of its heterogeneity. Specifically, our
method (i) identifies the preferential relationships between the elements, (ii)
naturally highlights the clustered structure of investigated systems, and (iii)
allows to classify links according to the type of statistically validated
relationships between the connected nodes.Comment: Main text: 13 pages, 3 figures, and 1 Table. Supplementary
information: 15 pages, 3 figures, and 2 Table
ProOpDB: Prokaryotic Operon DataBase
The Prokaryotic Operon DataBase (ProOpDB, http://operons.ibt.unam.mx/OperonPredictor) constitutes one of the most precise and complete repositories of operon predictions now available. Using our novel and highly accurate operon identification algorithm, we have predicted the operon structures of more than 1200 prokaryotic genomes. ProOpDB offers diverse alternatives by which a set of operon predictions can be retrieved including: (i) organism name, (ii) metabolic pathways, as defined by the KEGG database, (iii) gene orthology, as defined by the COG database, (iv) conserved protein domains, as defined by the Pfam database, (v) reference gene and (vi) reference operon, among others. In order to limit the operon output to non-redundant organisms, ProOpDB offers an efficient method to select the most representative organisms based on a precompiled phylogenetic distances matrix. In addition, the ProOpDB operon predictions are used directly as the input data of our Gene Context Tool to visualize their genomic context and retrieve the sequence of their corresponding 5′ regulatory regions, as well as the nucleotide or amino acid sequences of their genes
Pseudomonas Genome Database: facilitating user-friendly, comprehensive comparisons of microbial genomes
Pseudomonas aeruginosa is a well-studied opportunistic pathogen that is particularly known for its intrinsic antimicrobial resistance, diverse metabolic capacity, and its ability to cause life threatening infections in cystic fibrosis patients. The Pseudomonas Genome Database (http://www.pseudomonas.com) was originally developed as a resource for peer-reviewed, continually updated annotation for the Pseudomonas aeruginosa PAO1 reference strain genome. In order to facilitate cross-strain and cross-species genome comparisons with other Pseudomonas species of importance, we have now expanded the database capabilities to include all Pseudomonas species, and have developed or incorporated methods to facilitate high quality comparative genomics. The database contains robust assessment of orthologs, a novel ortholog clustering method, and incorporates five views of the data at the sequence and annotation levels (Gbrowse, Mauve and custom views) to facilitate genome comparisons. A choice of simple and more flexible user-friendly Boolean search features allows researchers to search and compare annotations or sequences within or between genomes. Other features include more accurate protein subcellular localization predictions and a user-friendly, Boolean searchable log file of updates for the reference strain PAO1. This database aims to continue to provide a high quality, annotated genome resource for the research community and is available under an open source license
Partial Homology Relations - Satisfiability in terms of Di-Cographs
Directed cographs (di-cographs) play a crucial role in the reconstruction of
evolutionary histories of genes based on homology relations which are binary
relations between genes. A variety of methods based on pairwise sequence
comparisons can be used to infer such homology relations (e.g.\ orthology,
paralogy, xenology). They are \emph{satisfiable} if the relations can be
explained by an event-labeled gene tree, i.e., they can simultaneously co-exist
in an evolutionary history of the underlying genes. Every gene tree is
equivalently interpreted as a so-called cotree that entirely encodes the
structure of a di-cograph. Thus, satisfiable homology relations must
necessarily form a di-cograph. The inferred homology relations might not cover
each pair of genes and thus, provide only partial knowledge on the full set of
homology relations. Moreover, for particular pairs of genes, it might be known
with a high degree of certainty that they are not orthologs (resp.\ paralogs,
xenologs) which yields forbidden pairs of genes. Motivated by this observation,
we characterize (partial) satisfiable homology relations with or without
forbidden gene pairs, provide a quadratic-time algorithm for their recognition
and for the computation of a cotree that explains the given relations
- …